Show the code
library(data.table)
library(sf)
library(ggplot2)
library(dplyr)
library(viridis)STA 9750 Individual Final Report
Sabrina Zhu
December 13, 2025
Specific Question: How does station infrastructure relate to e-bike vs classic bike usage?
This analysis explores whether the quality of bike-share infrastructure—measured by station density and proximity to bike lanes—affects which type of bike riders use. The findings contribute to the overarching group question: Within Manhattan, does bike-share usage respond more to infrastructure or to external factors?
# Load the 5% stratified sample
citibike_sample <- readRDS("data/processed/citibike_manhattan_sample_5pct.rds")
routes <- readRDS("data/individual_report/bike_routes.rds")
manhattan_poly <- readRDS("data/gis/manhattan_polygon.rds")
cat("Loaded", format(nrow(citibike_sample), big.mark = ","), "trips\n")Loaded 1,618,078 trips
Date range: 2024-09-29 to 2025-10-31
I classified each Citi Bike station by infrastructure quality using two metrics:
These were combined into a composite infrastructure score and categorized into Low, Medium, and High levels.
Found 734 unique stations
# Convert to spatial object
stations_sf <- st_as_sf(stations, coords = c("lng", "lat"), crs = 4326, remove = FALSE)
# Metric 1: Station Density (count nearby stations within 500m)
stations_buffer <- st_buffer(stations_sf, dist = 500)
stations$nearby_count <- sapply(1:nrow(stations_sf), function(i) {
sum(st_intersects(stations_buffer[i, ], stations_sf, sparse = FALSE)) - 1
})
# Metric 2: Distance to Bike Lanes
manhattan_bbox <- st_bbox(c(xmin = -74.02, xmax = -73.90, ymin = 40.70, ymax = 40.88),
crs = st_crs(4326)) |> st_as_sfc()
manhattan_routes_sf <- routes %>%
filter(st_intersects(geometry, manhattan_bbox, sparse = FALSE)[,1]) %>%
st_transform(4326)
stations$dist_to_bike_lane_m <- sapply(1:nrow(stations_sf), function(i) {
distances <- st_distance(stations_sf[i, ], manhattan_routes_sf)
min(as.numeric(distances))
})
# Create Infrastructure Score (normalized 0-1)
stations$density_score <- (stations$nearby_count - min(stations$nearby_count)) /
(max(stations$nearby_count) - min(stations$nearby_count))
max_dist <- quantile(stations$dist_to_bike_lane_m, 0.95)
stations$lane_proximity_score <- 1 - pmin(stations$dist_to_bike_lane_m, max_dist) / max_dist
stations$infrastructure_score <- 0.5 * stations$density_score + 0.5 * stations$lane_proximity_score
# Classify into tertiles
tertiles <- quantile(stations$infrastructure_score, probs = c(0.33, 0.67))
stations$infrastructure_level <- cut(
stations$infrastructure_score,
breaks = c(-Inf, tertiles[1], tertiles[2], Inf),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE
)
cat("\nInfrastructure Classification:\n")
Infrastructure Classification:
Low Medium High
242 250 242
This map shows how stations are classified across Manhattan. High infrastructure (green) is concentrated in Lower Manhattan, while low infrastructure (red) is more common in Upper Manhattan.
map1 <- ggplot() +
geom_sf(data = manhattan_poly, fill = "gray95", color = "gray60", size = 0.3) +
geom_sf(data = manhattan_routes_sf, color = "lightblue", alpha = 0.3, size = 0.5) +
geom_point(data = stations,
aes(x = lng, y = lat, color = infrastructure_level, size = nearby_count),
alpha = 0.7) +
scale_color_manual(
values = c("Low" = "#F44336", "Medium" = "#FF9800", "High" = "#4CAF50"),
name = "Infrastructure Level"
) +
scale_size_continuous(name = "Station Density", range = c(1, 4)) +
labs(
title = "Citi Bike Station Infrastructure",
subtitle = "Stations classified by bike lane proximity and station density",
caption = "Size indicates number of nearby stations within 500m"
) +
theme_minimal() +
theme(
panel.grid = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
legend.position = "right",
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, color = "gray40")
)
print(map1)# Calculate e-bike % by station
station_variation <- citibike_sample[, .(
ebike_pct = mean(rideable_type == "electric_bike") * 100,
trips = .N,
lat = mean(start_lat),
lng = mean(start_lng)
), by = start_station_name]
# Merge with infrastructure
station_full <- merge(station_variation,
stations[, .(station_name, infrastructure_level, infrastructure_score)],
by.x = "start_station_name",
by.y = "station_name",
all.x = TRUE)
# Calculate bike balance (relative to average)
overall_ebike_share <- mean(citibike_sample$rideable_type == "electric_bike") * 100
station_full[, bike_balance := ebike_pct - overall_ebike_share]
cat("Overall e-bike usage:", round(overall_ebike_share, 1), "%\n")Overall e-bike usage: 68 %
This scatter plot shows the relationship between infrastructure score and e-bike usage at the station level. Each dot is a station, colored by infrastructure level.
scatter <- ggplot(station_full[!is.na(infrastructure_score) & trips >= 50 & ebike_pct >= 40],
aes(x = infrastructure_score, y = ebike_pct)) +
geom_point(aes(color = infrastructure_level, size = trips), alpha = 0.6) +
geom_smooth(method = "lm", color = "black", linetype = "dashed", se = TRUE) +
scale_color_manual(values = c("Low" = "#F44336", "Medium" = "#FF9800", "High" = "#4CAF50"),
name = "Infrastructure Level") +
scale_size_continuous(range = c(1, 6), name = "Total Trips") +
labs(
title = "Infrastructure Score vs. E-bike Usage",
subtitle = "Each dot is a station | Dashed line shows overall trend",
x = "Infrastructure Score (higher = better infrastructure)",
y = "E-bike Usage (%)"
) +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold"))
print(scatter)
Correlation between infrastructure and e-bike usage: -0.165
Key Finding: There is a negative correlation between infrastructure score and e-bike usage. Higher infrastructure stations show lower e-bike usage rates.
This map shows where e-bikes vs classic bikes dominate across Manhattan, relative to the average (68% e-bike).
balance_map <- ggplot() +
geom_sf(data = manhattan_poly, fill = "gray95", color = "gray60", linewidth = 0.3) +
geom_point(data = station_full[trips >= 50],
aes(x = lng, y = lat, color = bike_balance, size = trips),
alpha = 0.9) +
scale_color_gradientn(
colors = c("#08306b", "#2171b5", "#6baed6", "#f7f7f7", "#fcbba1", "#fb6a4a", "#cb181d"),
values = scales::rescale(c(-60, -30, -10, 0, 10, 25, 40)),
limits = c(-60, 40),
breaks = c(-30, 0, 20),
labels = c("More classic", "Near avg", "More e-bikes"),
name = "Relative E-bike Share"
) +
scale_size_continuous(range = c(2, 7), name = "Station\nTrip Volume", labels = scales::comma) +
labs(
title = "Where Are Stations More or Less E-bike-Heavy?",
subtitle = sprintf("Color shows difference from Manhattan average (~%.0f%% e-bikes): orange = higher, blue = lower", overall_ebike_share),
caption = "Data: Citi Bike Manhattan trips | Stations with 50+ trips shown"
) +
coord_sf(datum = NA) +
theme_minimal() +
theme(
panel.grid = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
legend.position = "right",
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, color = "gray40")
)
print(balance_map)Key Finding: Lower Manhattan (high infrastructure) shows more classic bike usage (blue), while Upper Manhattan (low infrastructure) shows more e-bike usage (orange).
# Define time periods
citibike_sample[, time_period := fcase(
hour %in% c(6,7,8,9), "Morning Rush",
hour %in% c(10,11,12,13,14,15), "Midday",
hour %in% c(16,17,18,19), "Evening Rush",
hour %in% c(20,21,22,23,0,1,2,3,4,5), "Night"
)]
citibike_sample[, time_period := factor(time_period,
levels = c("Morning Rush", "Midday", "Evening Rush", "Night"))]
# Define volume groups
citibike_sample[, station_volume := .N, by = start_station_name]
citibike_sample[, volume_group := cut(station_volume,
breaks = quantile(station_volume, probs = c(0, 0.33, 0.67, 1)),
labels = c("Low Traffic", "Medium Traffic", "High Traffic"),
include.lowest = TRUE)]
# Calculate e-bike % by rider type, volume, and time
ebike_full <- citibike_sample[!is.na(volume_group) & !is.na(time_period), .(
ebike_pct = mean(rideable_type == "electric_bike") * 100,
trips = .N
), by = .(member_casual, volume_group, time_period)]
ebike_full[, rider_label := ifelse(member_casual == "casual", "Casual Riders", "Members")]heatmap <- ggplot(ebike_full, aes(x = time_period, y = volume_group, fill = ebike_pct)) +
geom_tile(color = "white", size = 1.5) +
geom_text(aes(label = paste0(round(ebike_pct, 1), "%")),
size = 4.5, fontface = "bold", color = "white") +
facet_wrap(~rider_label) +
scale_fill_gradient2(
low = "#3498db",
mid = "#95a5a6",
high = "#e74c3c",
midpoint = 68,
name = "E-bike\nUsage",
limits = c(60, 80),
breaks = seq(60, 80, 5),
labels = function(x) paste0(x, "%")
) +
labs(
title = "E-bike Usage Patterns Across Multiple Dimensions",
subtitle = "Blue = More classic bikes | Red = More e-bikes | Overall average = 68% e-bikes",
x = "Time of Day",
y = "Station Activity Level",
caption = "Rider Type Distribution: Casual – 22% | Member – 78%"
) +
theme_minimal(base_size = 13) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 11),
axis.text.y = element_text(size = 11),
strip.text = element_text(size = 14, face = "bold"),
plot.title = element_text(size = 16, face = "bold", hjust = 0),
plot.subtitle = element_text(size = 11, color = "gray30"),
plot.caption = element_text(hjust = 0, size = 9),
panel.grid = element_blank(),
legend.position = "right"
)
print(heatmap)Key Findings:
# Rider type summary
rider_summary <- citibike_sample[, .(
total_trips = .N,
electric_bikes = sum(rideable_type == "electric_bike"),
classic_bikes = sum(rideable_type == "classic_bike")
), by = member_casual]
rider_summary[, `:=`(
pct_electric = round(electric_bikes / total_trips * 100, 1),
pct_classic = round(classic_bikes / total_trips * 100, 1)
)]
cat("=== Rider Type Summary ===\n")=== Rider Type Summary ===
member_casual total_trips electric_bikes classic_bikes pct_electric
<char> <int> <int> <int> <num>
1: member 1340439 896332 444107 66.9
2: casual 277639 203233 74406 73.2
pct_classic
<num>
1: 33.1
2: 26.8
=== E-bike Usage by Infrastructure Level ===
infrastructure_level mean_ebike n_stations
<fctr> <num> <int>
1: Low 71.8 242
2: Medium 71.8 250
3: High 66.7 242
Infrastructure and e-bike usage have an inverse relationship: High-infrastructure stations show ~67% e-bike usage vs ~72% at low-infrastructure stations.
Geographic pattern: Lower Manhattan (high infrastructure) shows more classic bike usage, while Upper Manhattan (low infrastructure) shows more e-bike usage.
The explanation: High-infrastructure areas attract more riders, creating higher demand. E-bikes are more popular, so they get taken first. By the time many riders arrive, only classic bikes remain.
Pattern holds across conditions: Both casual riders and members show the same trends, with evening and low-traffic periods having the highest e-bike availability.
“Does bike-share usage respond more to infrastructure or external factors?”
For bike type choice, infrastructure affects availability indirectly through demand. High-infrastructure areas have high demand, which depletes e-bikes—forcing riders onto classic bikes. This is an infrastructure factor, but driven by usage patterns rather than bike lanes themselves.
Conclusion: Infrastructure doesn’t directly determine bike type preference, but it shapes availability by concentrating demand. Riders in high-infrastructure areas may want e-bikes but are forced onto classic bikes because of supply constraints.